Step 02 Data Wrangling

Introduction

In the previous step 01 Problem statement, we have defined the goal of this project and just have downloaded hte raw data.

In this step 02 Data Wrangling, we clean up the data as much as we can to make it ready for the next step EDA(Exploratory Data Analysis).

The major tasks will be to check the data type used, missing values, and transform the style as necessary.

Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Load data

In [2]:
dfm = pd.read_csv('../data/step1_dfm.csv')
dfr = pd.read_csv('../data/step1_dfr.csv',parse_dates = ['reviewTime'])

# convert dataframe values to string
dfm = dfm.astype(str)
In [3]:
# remove the html tags
# http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/
def remove_html_tags(raw_html):
    from bs4 import BeautifulSoup
    cleantext = BeautifulSoup(raw_html, "lxml").text
    return cleantext

check duplicates

In [4]:
# Regarding meta data, check for duplicates
dfm.duplicated().sum(),dfr.duplicated().sum()
Out[4]:
(21927, 243442)

check null

In [5]:
# Any null entries
dfm.isnull().values.any(),dfm.isnull().values.any()
Out[5]:
(False, False)

There seems to be no missing data, but the current missing values are filled with empty, which saves time in the following process. So, we go ahead as they are now.

meta data

In [6]:
dfm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHighRes  203766 non-null  object
 18  details          203766 non-null  object
dtypes: object(19)
memory usage: 29.5+ MB

category column

In [7]:
dfm.category.value_counts().to_frame()
Out[7]:
category
['Movies & TV', 'Movies'] 30831
['Movies & TV', 'Genre for Featured Categories', 'Action & Adventure'] 10761
['Movies & TV', 'Genre for Featured Categories', 'Drama'] 9639
['Movies & TV', 'Genre for Featured Categories', 'Documentary'] 8843
['Movies & TV', 'Genre for Featured Categories', 'Kids & Family'] 8029
... ...
['Movies & TV', 'Charlie Rose Store', 'Entertainment', 'Television'] 1
['Movies & TV', 'Independently Distributed', 'Documentary', 'Made in USA', 'Woven Badge', 'Ribbed back neck and cuffs', 'Printed front with plain back'] 1
['Movies & TV', 'Studio Specials', 'Sony Pictures Home Entertainment'] 1
['Movies & TV', 'African American Cinema', 'Breakthrough Cinema'] 1
['Movies & TV', 'Fully Loaded DVDs', 'Five Star Collection'] 1

715 rows × 1 columns

It is found that the data in category has some levels although the first level, 'Movies & TV', will not be necessary

In [8]:
# check what is in 'Special Interests' category
dfm[dfm.category.str.contains('Special Interests')].category.unique()
Out[8]:
array(["['Movies & TV', 'Genre for Featured Categories', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Art & Artists']",
       "['Movies & TV', 'Independently Distributed', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'History']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'General']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Nature & Wildlife']",
       "['Movies & TV', 'Studio Specials', 'Warner Home Video', 'Warner Video Bargains', 'Special Interests']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Travel']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Fitness']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Health']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Religion & Spirituality']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Yoga']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Dance']",
       "['Movies & TV', 'Boxed Sets', 'Special Interests', 'Cooking & Beverages']",
       "['Movies & TV', 'Genre for Featured Categories', 'Special Interests', 'Manufactured to the Highest Quality Available.', 'With True Enhanced Performance.', 'Latest Technical Development.']",
       '[\'Movies & TV\', \'Genre for Featured Categories\', \'Special Interests\', "In this video you will see demonstrations from a group of Hollywood\'s finest artists as they create a prosthetic makeup including false teeth and hair Also includes an interview and career profile of makeup legend Tom Burman.", \'While Supplies Last As always\']'],
      dtype=object)

I thought that the parents of 'Special Interests' is identical, but they are different and there is no consistency to avoid duplicated naming in the category structure.

In [9]:
# Convert the list-like string to list
from ast import literal_eval
dfm.category = dfm.category.apply(lambda x: literal_eval(x))
In [10]:
max_cat = dfm.category.apply(len).max()
max_cat
Out[10]:
7

The maxium #level of categories is 7.

In [11]:
dfr.overall.value_counts(dropna=False)
Out[11]:
5.0    5491586
4.0    1498047
3.0     735907
1.0     633153
2.0     406875
Name: overall, dtype: int64
In [12]:
dfr[dfr.overall=='']
Out[12]:
overall reviewTime reviewerID asin
In [13]:
# for i in range(max_cat):
#     dfm[str('cat'+str(i+1))] = dfm['category'].apply(lambda x: x[i] if i<len(x) else '')
In [14]:
# remove 'Movies & TV' as this meta and review dataframe is only ordinally from 'Movies & TV' category, and it is unnecessary
dfm['category'].apply(lambda x : x.remove("Movies & TV") if "Movies & TV" in x else x)

# print the unique values and its count
categories = dfm.category.value_counts().to_frame()

categories = categories.reset_index()

categories.columns = ['cat','num_cat']
categories.cat = categories.cat.apply(str)

# Number of movies per category
sns.catplot(data = categories.iloc[:10], x = 'num_cat', y = 'cat', kind = 'bar', orient= 'h', color="skyblue");
In [15]:
# stock = []

# for i in range(len(categories)):
#     stock.append(dfm.iloc[i].category[0])
# set(stock)

main_cat

In [16]:
dfm.main_cat.value_counts().to_frame()
Out[16]:
main_cat
Movies & TV 203464
Sports & Outdoors 118
Books 35
Amazon Home 33
<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/> 31
Toys & Games 17
All Electronics 10
nan 10
Cell Phones & Accessories 9
Industrial & Scientific 8
Health & Personal Care 7
<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/> 6
All Beauty 3
Arts, Crafts & Sewing 3
Office Products 2
Pet Supplies 2
Tools & Home Improvement 2
Musical Instruments 1
Home Audio & Theater 1
Camera & Photo 1
Grocery 1
Baby 1
Video Games 1
In [17]:
dfm[dfm['main_cat']!='Movies & TV'].head()
Out[17]:
category tech1 description fit title also_buy tech2 brand feature rank also_view main_cat similar_item date price asin imageURL imageURLHighRes details
793 [Christian Video, Biography] nan ['When Superbook falls onto a computer keyboar... nan Moses &amp; Plagues/Moses &amp; Israelites (Su... [] nan Tyndale [] 5,869,236 in Books ( [] Books nan nan nan 084236837X [] [] nan
794 [Christian Video, Bible] nan ['Two inquisitive children stumble upon an anc... nan Nehemiah &amp; the Wall / Job (Superbook Video... [] nan Visit Amazon's Tyndale Page [] 8,536,790 in Books ( ['B00N37I5R2'] Books nan nan nan 0842368167 [] [] nan
898 [Musicals & Performing Arts] nan ["Learn how to play and practice the classic r... nan Basic Afro-Cuban Rhythms for Drum Set and Hand... [] nan nan [] 3,758,302 in Books ( [] Books nan nan $14.62 0876390300 [] [] nan
1274 [Movies] nan ["The Family Video can be enjoyed as pure ente... nan Family [] nan Memory Lane [] 6,710,279 in Books ( [] Books nan nan $5.49 0976224976 [] [] nan
1498 [TV] nan ["Beverly Hills 90210 - The Complete First Sea... nan Beverly Hills 90210 - The First Season [DVD] [] nan nan [] 2,550,757 in Beauty & Personal Care ( [] All Beauty nan nan nan 1415717214 [] [] nan

They are unnecessary.

In [18]:
dfm = dfm[dfm['main_cat']=='Movies & TV']

description column

In [19]:
# Check the description
dfm.description.value_counts().to_frame().tail(2)
Out[19]:
description
['WITCHOUSE'] 1
["Marco's latest release is his most detailed yet. The DVD includes almost one full hour of drum solos, performance with the Marco Minnemann big band and guitar hero Mike Kneally, several play-along tracks, instructional footage, comprehensive bonus feature"] 1
In [20]:
# check the data in the row of 66812.
dfm.iloc[66812].description
Out[20]:
'[]'

Clean in the same way as above

In [21]:
# Fill empty categories with empty string
dfm['description'] = dfm['description'].apply(lambda x: '' if x == '[]' else x)

# Clean html letter code
dfm['description'] = dfm[['description']].applymap(remove_html_tags)
In [22]:
# confirm the result
dfm.iloc[66812].description
Out[22]:
''
In [23]:
# check other rows
dfm[dfm.description.str.contains('///')].iloc[3].description
Out[23]:
'[\'Journey in China  West Lake Hangzhou DVD Language: Chinese and English Subtitle: English-Chinese Bilingual Code Are: All Regions 16:9 NTSC ISBN: 9787884200153 Introduction: West Lake is located in the western area of Hangzhou City\\\'s historic center. There are dozens of lakes called West Lake worldwide, but "West Lake" usually refers to the Hangzhou West Lake. It is surrounded by mountains on three sides, with an area of around 6.5 square kilometers. The circumference is around 15 kilometers. The average depth of West Lake is 2.27 meters, and the capacity is about 14,290,000 cubic meters. The lake is divided by Gu Shan, Bai, Su and Yanggong Causeways into five areas. Ordered by their areas, they are Outer West Lake (), West Inner Lake (, or , or ), North Inner Lake ( or ), Little South Lake ( or ) and Yue Lake (). "Outer West Lake" is the largest. "Gu Shan" or Gu Hill is the largest natural island in the lake. Su & Bai Causeways run cross the lake. Three small man-made islands, "Xiao Ying Zhou" (), "Hu Xing Ting" (), "Ruan Gong Dun" (), lie in the center of Outer West Lake. Thus, the basic layout is "one hill, two causeways, three islands, and five lakes". /// This is a great product sourced from BIML - Bible In My Language, the leader in foreign language Bibles and outreach materials from Baltimore, Maryland in the USA. BIML stocks Bibles in more than 600 languages.\']'

We still have unnecessary characters such as the \\ or /// but now just leave them.

?? remove the \\ or ///

title column

In [24]:
dfm.title.value_counts().to_frame()
Out[24]:
title
nan 59
Treasure Island 15
Live 11
Mozart: Don Giovanni 10
WWE 10
... ...
Revolution (2012) ( (R)evolution ) [ Blu-Ray, Reg.A/B/C Import - Germany ] 1
Dido &amp; Aeneas: A Choreographic Opera 1
All About John Deere For Kids VHS#4 1
Believe: The Eddie Izzard Story 2010 1
No Man's Land / Ronin / Harley Davidson and the Marlboro Man / Blown Away 1

177707 rows × 1 columns

In [25]:
# Clean html letter code
dfm['title'] = dfm[['title']].applymap(remove_html_tags)

# remove any leading and trailing spaces
dfm['title'] = dfm[['title']].apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# replace nan for ''
dfm['title'] = dfm[['title']].replace('nan', 'no-title')
In [26]:
dfm.title.value_counts().to_frame()
Out[26]:
title
no-title 59
Treasure Island 15
Live 11
Mozart: Don Giovanni 10
WWE 10
... ...
The Goon Show Old Time Radio OTR Mp3 Collection on DVD - Offering 82 episodes with 38 hours of entertainment 1
Remembering September 11th 1
Boy of Two Worlds 1
Girl Who Leapt Through Space 2 1
No Man's Land / Ronin / Harley Davidson and the Marlboro Man / Blown Away 1

177513 rows × 1 columns

brand column

In [27]:
dfm.brand.value_counts().to_frame()
Out[27]:
brand
nan 66132
Various 3178
. 1041
\n 434
- 427
... ...
Paola Mendoza 1
Narration by Heather Ryan 1
Jzsef Madaras 1
Kato Buss 1
Grant Withers Frankie Darro 1

55256 rows × 1 columns

In [28]:
# Clean html letter code
dfm['brand'] = dfm[['brand']].applymap(remove_html_tags)

# remove any leading and trailing spaces
dfm['brand'] = dfm[['brand']].apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# replace specific values for ''
dfm['brand'] = dfm[['brand']].replace(['nan','.','-','*','','None'], 'no-brand')
In [29]:
dfm.brand.value_counts().to_frame()
Out[29]:
brand
no-brand 68695
Various 3178
Learn more 358
Sinister Cinema 265
John Wayne 233
... ...
Narration by Heather Ryan 1
Jzsef Madaras 1
Kato Buss 1
Ty Burrell 1
Grant Withers Frankie Darro 1

55246 rows × 1 columns

asin

In [30]:
print('The number of products: {}'.format(dfm.asin.nunique()))
The number of products: 181552

review data

In [31]:
dfr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 4 columns):
 #   Column      Dtype         
---  ------      -----         
 0   overall     float64       
 1   reviewTime  datetime64[ns]
 2   reviewerID  object        
 3   asin        object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 267.5+ MB

overall

In [32]:
dfr[['overall','reviewTime']].describe().T
Out[32]:
count mean std min 25% 50% 75% max
overall 8765568.0 4.233011 1.221438 1.0 4.0 5.0 5.0 5.0

reviewerID

In [33]:
print('The number of reviewer in review data: {}'.format(dfr.reviewerID.nunique()))
The number of reviewer in review data: 3826085

asis

In [34]:
print('The number of products in review data: {}'.format(dfr.asin.nunique()))
The number of products in review data: 182032

Delete unnecessary data

In [35]:
# Delete the rows that only exist in the either review or meta data
dfm = dfm.astype(str).drop_duplicates()
dfr = dfr.astype(str).drop_duplicates()

Save data

This should be done aftre imputation of communityAverage as there are null values in the column.

In [36]:
# save the data to a new csv file
datapath = '../data'
save_file(dfm, 'step2_cleaned_dfm.csv', datapath)
save_file(dfr, 'step2_cleaned_dfr.csv', datapath)
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step2_cleaned_dfm.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step2_cleaned_dfr.csv"

Summary

The key processes and findings from this notebook are as follows.

  • Rows deleted: The rows(asin in the meta data) have been deleted.
  • Columnns deleted: 0 column
  • meta data
    • Category column: Some data which seem unnecessary such as 'Movies or TV' have been removed (possibly they could have been splited into separate columns of 'Movies or TV' and 'Featured genre') It was expected to visualise the structure of category usig node to clarify the structure, but it takes time and not the main purpose here, so category data was remained as it is.
    • Description column: html script has been removed, but some symbols such as '\\' or '///' still remains. Those unnecessary symbols will be removed as necessary in the future steps.